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An amino acid sequence encodes a message that deter- 
mines the shape and function of a protein. This message is 
highly degenerate in that many different sequences can 
code for proteins with essentially the same structure and 
activity. Comparison of different sequences with similar 
messages can reveal key features of the code and improve 
understanding of how a protein folds and how it per* 
forms its function. 



THE GENOME IS MANIFEST LARGELY IN THE SET OF PRO- 
tcins that it encodes. It is the ability of these proteins to fold 
into unique three-dimensional structures that allows them to 
function and carry out the instructions of the genome. Thus, 
comprehending the rules that relate amino acid sequence to struc- 
ture is fundamental to an understanding of biological processes. 
Because an amino acid sequence contains all of the information 
necessary to determine die structure of a protein (I), it should be 
possible to predict structure from sequence, and subsequently to 
infer detailed aspects of function from die structure. However, both 
problems are extremely complex, and it seems unlikely that cither 
will be solved in an exact manner in the near future. It may be 
possible to obtain approximate solutions by using experonental data 
to simplify the problem. In this article, we describe how an analysis 
of allowed amino acid substitutions in proteins can be used to 
reduce the complexity of sequences and reveal important aspects of 
structure and function 



Methods for Studying Tolerance to 
Sequence Variation 

There are two main approaches to studying the tolerance of an 
amino acid sequence to change. The first method relies on the 
process of evolution, in which mutations are either accepted or 
rejected by natural selection. This method has been extremely 
powerful for proteins such as the globins or cytochromes, for which 
sequences from many different species are known (2-7). The second 
approach uses genetic methods to introduce amino acid changes at 
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specific positions in a cloned gene and uses selections or screens to 
identify functional sequences. This approach has been used to great 
advantage for proteins that can be expressed in bacteria or yeast, 
where the appropriate genetic manipulations are possible (3, 8-11). 
The end results of both methods arc lists of active sequences that can 
be compared and analyzed to identify sequence features that arc 
essential for folding or function. If a particular property of a side 
chain, such as charge or size, is important at a given position, only 
side chains that have the required property will be allowed. Con- 
versely, if the chemical identity of the side chain is unimportant, 
then many different substitutions will be permitted. 

Studies in which these methods were used have revealed that 
proteins are surprisingly tolerant of amino acid substitutions (2-4, 
11). For example, in studying the effects of approximately 1500 
single amino acid substitutions at 142 positions in ice repressor, * 
Miller and co-workers found that about one-half of all substitutions 
were phenotypically silent {11). At some positions, many different, 
nonconservative substitutions were allowed. Such residue positions 
play little or no role in structure and function. At other positions, no 
substitutions or only conservative substitutions were allowed. These 
residues are the most important for lac repressor activity. 

What roles do invariant and conserved side chains play in 
proteins? Res idues .that are directly involved in protein functions 
such as binding or catalysis will certainly be among the most 
co nserved - For example, replacing the Asp in the catalytic triad_pf 
trypsin with Asn results in a 10 4 -fold reduction in activity (12). A 
similar loss of activity occurs in X repressor when a DNA binding 
residue is changed from Asn to Asp (13). To carry out their 
function, however, these catalytic residues and binding residues 
must be precisely oriented in three dimensions. Consequently, 
mutations in residues that are required for structure formation or 
stability can also have dramatic effects on activity (10 1 14-16). 
Hence, many of the residues that are conserved in sets of related 
sequences play structural roles. 



Substitutions at Surface and Buried Positions 

In their initial comparisons of the globin sequences, Perutz and 
co-workers found that most buried residues require nonpolar side 
chains, whereas few features of surface side chains are generally 
conserved (6"). Similar results have been seen for a number of protein 
families (2, 4, 5, 7, 11, 18). An example of the sequence tolerance at » 
surface versus buried sites can be seen in Fig. 1, which shows the 
allowed substitutions in A repressor at residue positions that are near 
the dimer interface but distant from the DNA binding surface of the 
protein (9). These substitutions were identified by a functional 
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1. (A) Amino add substitutions allowed in a 
region of X repressor. The wild-type se- 
ts shown along the center line. The al* 
subsaniriofts shown above each position 
idenrined by randomly mutating one to 
codons at a time by using a cassette method 
applying a runcoonaJ selection (9). (B) The 
onai solvent accessibility (42) of the wild- 
side chain in the protein dimer (4)) relative 
same atoms in an Ala-X-Ala model tripep- 
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3on after cassette mutagenesis. A histogram of side chain 
nt accessibility in the crystal structure of the diracr is also 
in Fig. 1. At six positions, only the wild-type residue or 
vcly conservative substitutions axe allowed. Five of these 
ens are buried in the protein. In contrast, most of the "highly 
ied positions tolerate a wide range of chemically different side 
, including hydrophilic and hydrophobic residues. Hence, it 
that most of the structural information in this region of the 
in is carried by the residues that arc solvent inaccessible. 



Lstraints on Core Sequences 

use core residue positions appear to be extremely important 
otcin folding or stability, we must understand the factors that 
whether a given core sequence will be acceptable. In ^encralr 
lydrophobic or neutral residues are tolerated at buried sites in 

^undoubtedly because of the large favprahjr rrmrrij y njfoj "f 
*drc*phobic effect to protein stability (f 9). For cxamplc^^Fig: 2 
Lrhcjcsults of genetic studies used to invesrjgaic^fcixcram 
allowed at residue positions that form the hydrophobic core of 
H2-tcrrninaJ domain of X repressor (20). The ^acceptable -core 
ces are composed almost exclusively of Ala, Cys, Jrir, Vat lie, 
vlct, and Phe. The acceptability of many different residues at 
ore position presumably reflects the fact that the hydrophobic 
unlike hydrogen bonding, does not depend on specific 
e pairings. Although it is possible to imagine a hypothetical 
tructure that is stabilized exclusively .: by residues forming 
;en bonds and salt bridges, such a core would probably be 
It to construct because hydrogen bonds require pairing of 
and acceptors in an exact geometry. Thus the repertoire of 
le structures that use a polar core would probably be extreme- 
ted (27). Polar and charged residues arc occasionally found in 
res of proteins, but only at positions where their hydrogen 
g needs can be satisfied (22). 

cores of most proteins arc quite closely packed (23), but some 
: changes are acceptable. In K repressor, the overall core 
: of acceptable sequences can vary by about 10%. Changes at 
iual sites, however, can be considerably larger. For example, 
■vn in Fig. 2, both Phe and Ala arc allowed at the same core 
>n in the appropriate sequence contexts. Large volume 
at individual buried sites have also been observed in 
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phyiogenetic studies, where it has been noted that the size decreases 
and increases at interacting residues arc not necessarily related in a 
simple complementary fashion (5, 7, 17). Rather, local volume 
changes are accommodated by conformational changes in nearby 
side chains and by a variety of backbone movements. 



The Informational Importance of the Gore 

With occasional exceptions, the core must remain hydrophobic 
and maintain a reasonable packing density. However, since the core 
is composed of side chains that can assume only a limited number of 
conformations (24), efficient packing must be maintained without 
steric clashes. How important arc hydrophobicity, volume, and 
steric complementarity in determining whether a given sequence can 
form an acceptable core? Each factor is essential in a physical sense, 
as a stable core is probably unable to tolerate unsatisfied hydrogen 
bonding groups, large holes, or steric overlaps (25). However, in an 
informational sense, these factors are not equivalent. For example, in 
experiments -in -which three core residues -of X repressor ^were 
mutated simultaneously, volume was a relatively unimportant infor- 
mational constraint because three-quarters of all possible combina- 
tions of the 20 naturally occurring amino acids had volumes within 
the range tolerated in the core, and yet most of these sequences were 
unacceptable (20). In contrast, of the sequences that contained only 



Rg. 2. Amino acid substitu- 
tions allowed in the core of X 
repressor. The wild-type side 
chains are shown pictorially in 
the approximate orientation 
seen in the crystal strucrure 
(43). The lists of allowed sub- 
stitutions at each position are 
shown below the wild-type 
side chains. These substitu- 
tions were identified by ran- 
domly mutating one to four 
residues at a time by using a 
cassette method and applying 
a functional selection (20). 
Not all subsriturions are al- 
lowed in every sequence back- 
ground. 
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the appropriate hydrophobic residues, a significant fraction were 
acceptable. Hence, the hydrophobicity of a sequence contains 
more information about its potential acceptability in the core than 
does the total side chain volume. Stcnc compatibility was intermedi- 
ate between volume and hydrophobicity in informational impor- 



tance. 



The Informational Importance of Surface Sites 

We have noted that many surface sites can tolerate a wide variety 
of side chains, including hydrophilic and hydrophobic residues. This 
result might be taken to indicate that surface positions contain little 
structural information. However, Bashford ti aL y in an extensive 
analysis of globin sequences (4), found a strong bias against large 
hydrophobic residues at many surface positions. At one level, this 
may reflect constraints imposed by protein solubility, because large 
patches of hydrophobic surface residues would presumably lead to 
aggregation. At a more fundamental level, protein folding requires a 
partitioning between surface and buried positions. Conscquendy, to 
achieve a unique native state without significant competition from 
other conformations, it may be important that some sites have a 
decided preference for exterior rather than interior positions. As a 
result, many surface sites can accept hydrophobic residues individ- 
ually, but the surface as a whole can probably tolerate only a 
moderate number of hydrophobic side chains. 



Identification of Residue Roles from 
Sets of Sequences 

Often, a protein of interest is a member of a family of related 
sequences. What can we infer from the partem of allowed substitu- 
tions at positions in sets of aligned sequences generated by genetic 
or phylogenetic methods? Residue positions that can accept a 
number of different side chains, including charged and highly polar 
residues, are almost certain to be on the protein surface. Residue 
positions that remain hydrophobic, whether variable or not, are 
likely to be buried within the structure. In Fig. 3, those residue 
positions in X repressor that can accept hydrophilic side chains are 
shown in orange and those that cannot accept hydrophilic side 
chains are shown in green. The obligate hydrophobic positions 
define the core of the structure, whereas positions that can accept 
hydrophilic side chains define the surface. 

Functionally important residues should be conserved in sets of 
active sequences, but it is not possible to decide whether a side chain 
is functionally or structurally important just because it is invariant or 
conserved. To make this distinction requires an independent assay of 
protein folding. The ability of a mutant protein to maintain a stably 
folded structure can often be measured by biophysical techniques, 
by susceptibility to intracellular proteolysis (26) , or by binding to 
antibodies specific for the native structure (27, 28). In the latter 
cases, it is possible to screen proteins in mutated clones for the 
ability to fold even if these proteins are inactive. Sets of sequences 
that allow formation of a stable structure can then be compared to 
the sets that allow both folding and function, with the active site or 
binding residues being those that are variable in the set of stable 
proteins but invariant in the set of functional proteins. The DNA- 
binding residues of Arc repressor were identified by this method (8). 
The receptor- binding residues of human growth hormone were also 
identified by comparing the stabilities and activities of a set of 
mutant sequences (28). However, in this case, the mutants were 
generated as hybrid sequences between growth hormone and related 
hormones with different binding specificities. 



Implications tor Structure Prediction 

At present, the only reliable method for predicting a low- 
resolution tertiary structure of a new protein is by identifying 
sequence similarity to a protein whose structure is already known 
(29, 30). However, it is often difficult to align sequences as the level 
of sequence similarity decreases, and it is sometimes impossible to 
detect statistically significant sequence similarity between distantly 
related proteins. Because the number of known sequences is far 
greater than the number of known structures, it would be advanta- 
geous to increase the reach of the available structural information by 
improving methods for detecting distant sequence relations and for 
subsequently aligning these sequences based on structural principles. 
In a normal homology search, the sequence database is scanned with 
a single test sequence, and every residue must be weighted equally. 
However, some residues arc more important than others and should 
be weighted accordingly. Moreover, certain regions of the protein 
are more likely to contain gaps than others. Both kinds of informs- 
don can be obtained from sequence sets, and several techniques have 
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Fig. 3. Tolerance of positions in the NH 2 -tcrminal domain of X repressor* 
hydrophilic side chains. The complex (43) of the repressor dimer (blue) 
operator DNA (white) is shown. In (A), positions that can ro 
hydrophilic side chains are shown in orange . The same side chains arc J 
in (B) without the remaining protein atoms. In (C), positions that n 
hydrophobic or neutral side chains are shown in green. These fide cn *U 
shown in (D) without the remaining protein atoms. About three-fourxn* 
the 92 side chains in the NH 2 -rcrminal domain arc included in both (B) 
(D). The remaining positions have not been tesred. Data are from (9, M» 
27, 44). 
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been used to combine such information into more a pp ropriately 
weighted sequence searches and alignments (31). These methods 
were used to align the sequences of raroviral proteases with asparric 
} proteases, which in turn allowed construction of a threeniimension* 
al model for the protease of human immunodeficiency virus type 1 
(29). Comparison with the recently determined crystal structure of 
this protein revealed reasonable agreement in many areas of the 
predicted structure (32). 

The structural information at most surface sites is highly degener- 
ate. Except for functionally important residues, exterior positions 
seem to be important chiefly in mamtaining a reasonably polar 
surface. The information contained in buried residues is also 
degenerate, the main requirement being that these residues remain 
hydrophobic. Thus, at its most basic level, the key structural 
message in an amino acid sequence may reside in its specific partem 
of hydrophobic and hydrophilic residues. This is meant in an 
informational sense. Clearly, the precise structure and stability of a 
protein depends on a large number of detailed interactions. It is 
possible, however, that structural prediction at a more primitive 
level can be accomplished by concentrating on the most basic 
informational aspects of an amino acid sequence. For example, 
amphipathic patterns can be extracted from aligned sets of sequences 
and used, in some cases, to identify secondary structures. 

If a region of secondary structure is packed against the hydropho- 
bic core, a pattern of hydrophobic residues reflecting the periodicity 
of the secondary structure isexpectcd (33, 34). These patterns can be 
obscured in individual sequences* by "hydrophobic residues on the 
protein surface. It is rare, however, for a surface position to remain 
hydrophobic over the course of evolution. Consequendy, the am- 
phipathic patterns expected for simple secondary structures can be 
much dearer in a set of related sequences (6). This prindplc is 
illustrated in Fig. 4, which shows helical hydrophobic moment plots 
for the Antennapcdia homeodomain sequence (Fig. 4A) and for a 
composite sequence derived from a set of homologous homeodo- 
main proteins (Fig. 4B) (3S). The hydrophobic moment is a simple 
measure of the degree of amphipathic character of a sequence in a 
given secondary structure (34)" The amphipathic character of the 
three a-helical regions in the Antennapedia protein (36) is dearly 
revealed only by the analysis of the combined set of homeodomain 
sequences. The secondary stnuture of Arc repressor, a small DNA- 
binding prot ein, was recently predicted by a similar method (8) and 
confirmed by nudear rhagnetk resonance studies (37). 

The specific pattern of hydrophobic and hydrophilic residues in 
an amino add sequence must limit the number of different structures 
a given sequence can adopt and may indeed define its overall fold. If 
this is true, men the arrangement of hydrophobic and hydrophilic 
residues should be a characteristic feature of a particular fold. Sweet 
and Eiscnberg have shown that the correlation of the pattern of 
hydrophobidty between two protein sequences is a good criterion 
for their structural rdatedness (38). In addition, several studies 
indicate that parrcms of obugatoryjjydrophobic positions identified 
from aligned sequences are distinctive features of sequences that 
adopt the same structure 29, 38, 39). Tnus, the order of 
hydrophobic and hydrophilic residues in a sequence may actually be 
sufficient information to daerminc the basic folding partem of a 
Protein sequence. 
Although the partem of sequence hydrophobiciry may be a 
haraaeristic feature of a particular fold, it is not yet dear how such 
ttems could be used for prediction of structure de novo. It is 
^lf 0 ^ t0 undcmand how patterns in sequence space can be 
^tcd to structures in conformation space. Lau and Dill have 
R>roachcd this problem by studying the properties of simple 
UC ?!*? ^P 05 ^ on] y of H (hydrophobic) and P (polar) groups 
t^OKlimensional lattices (40). An example of such a represent* 
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bon is shown in Fig. 5. Residues adjacent in the sequence must 
occupy adjacent squares on the lattice, and rwo residues cannot 
occupy the same space. Free energies of particular conformations are 
evaluated with a single term, an attraction of H groups. By 
considering chains of ten residues, an exhaustive conformational 
search for all 1024 possible sequences of H and P residues was 
possible. For longer sequences only a representative fraction of the 
allowed sequence or conformation space could be explored. The 
significant results were as follows: (i) not all sequences can fold into 
a "native" structure and only a few sequences form a unique native 
structure; (ii) the probability that a sequence will adopt a unique 
native structure increases with chain length; and (iii) the native 
states are compact, contain a hydrophobic core surrounded by polar 
residues, and contain significant secondary structure. Although the 
gap between these rwo-dirnensional simulations and three-dimen- 
sional structures is large, the use of simple rules and sequence 
representations yields results similar to those expected for real 
proteins. Three-dimensional lattice methods arc also beginning to 
be developed and evaluated (41). 



Summary 

There is more information in a set of related sequences than in a 
single sequence. A number of practical applications arise from an 
analysis of the tolerance of residue positions to change. First, such 
information permits the evaluation of a residue's importance to the 
function and stability of a protein. This ability to identify the 
essential elements of a protein sequence may improve our under- 
standing of the determinants of protein folding and stability as well 
as protein function. Second, patterns of tolerance to amino add 
substitutions of varying hydrophilidty can hdp to identify residues 
likely to be buried in a protein structure and those likely to occupy 



Fig. 4. Helical hydro- 
phobic moments calcu- 
lated by using (A) the 
Antennapcdia homeodo- 
main sequence or (B) a 
ietof 39 aligned horneo- 
'domaln sequences (35). 
The ban indicate the ex- 
tent of the helical re- 
gions identified in nucle- 
ar magnetic resonance 
studies of the Antenna- 
pcdia homeodomain 
(J6). To determine hy- 
drophobic moments, 
residues were assigned 
to one of three groups: 
HI (high hydrophobici- 
ry * Trp, Ik, Phc, Leu, 
Met, Val, or Cys); H2 
(medium hydrophobic- 
iry « Tyr, Pro, Ala, Thr, 
His, Gly, or Ser); and H3 (low hydrophobiciry «= Gin, Asn, Glu, Asp, Lys, 
or Arg). For the aligned homeodomain sequences, the residues at each 
position were sorted by thrir hydrophobicity by using the scale of Faucherc 
and Pliska (45). Arg and Lys were not counted unless no other residue was 
found at the position, because they contain long aliphatic side chains and can 
thereby substitute for nonpolar residues at some buried sites. To account for 
possible sequence errors and rare exceptions, the most hydrophilic residue 
allowed at each position was discarded unless it was observed twice. The 
second most hydrophilic residue was then chosen to represent the hydropho- 
bidty of each position. An eight-residue window was used and the vectors 
projected radially every 1 00*. The vector magnitudes were assigned a value of 
1, 0, or - 1 for positions where the hydrophobidty group was HI, H2, or 
H3, respectively. 
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Rq. 5. A representation of one com- 
pact conformation for a parrM-nkr 
sequence of H and P residues on a 
nvo-dimensional square lattice. 
[Adapted from (40), with permis- 
sion of the American Chemical Soci- 
ety] 



face positions. The amphipathic patterns that emerge can be used 
denary probable regions of secondary structure. Third, incorpo 
ng a knowledge of allowed substitutions can improve the ability 
detect and align distantly related proteins because the 
dues can be given prominence in the alignment scoring, 
is more sequences are determined, it becomes increasingly likely 
t a protein of interest is a member of a family of related 
uences. If this is not the case, it is now possible to use genetic 
thods to generate lists of allowed amino acid substitutions, 
nscquently, at least in the short term, it may not be necessary to 
rc the folding problem for individual protein sequences. Instead, 
)rmation from sequence sets could be used. Perhaps by simplify* 
sequence space through the identification of key residues, and by 
plifying conformation space as in the lattice methods, it will be 
sible to develop algorithms to generate a limited number of trial 
ictures. These trial structures could then, in turn, be evaluated by 
iicr experiments and more sophisticated energy calculations. 
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